Exploratory data analysis (EDA)¶
In [5]:
import wandb
import pandas as pd
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)
wandb version 0.16.2 is available! To upgrade, please run:
$ pip install wandb --upgrade
Tracking run with wandb version 0.16.0
Run data is saved locally in
/mnt/c/Users/Tania/Desktop/mlops-project2/build-ml-pipeline-for-short-term-rental-prices/src/eda/wandb/run-20240112_210047-tywu74ul
View project at https://wandb.ai/tania-m/nyc_airbnb
General data profiling¶
In [10]:
# pandas_profiling was renamed to ydata_profiling
# import pandas_profiling
from ydata_profiling import ProfileReport
profile = ProfileReport(df,
title="Profiling Report")
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Data fixes¶
In [7]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])
In [8]:
run.finish()
VBox(children=(Label(value='0.006 MB of 0.060 MB uploaded (0.003 MB deduped)\r'), FloatProgress(value=0.096609…
W&B sync reduced upload amount by 5.5%
View run rural-eon-9 at: https://wandb.ai/tania-m/nyc_airbnb/runs/tywu74ul
View job at https://wandb.ai/tania-m/nyc_airbnb/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEyOTgwOTk4NQ==/version_details/v4
Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 1 other file(s)
View job at https://wandb.ai/tania-m/nyc_airbnb/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEyOTgwOTk4NQ==/version_details/v4
Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 1 other file(s)
Find logs at:
./wandb/run-20240112_210047-tywu74ul/logs
In [ ]:
In [ ]: